Closed
Bug 595884
Opened 14 years ago
Closed 14 years ago
JM: make f.apply(x, arguments) fast
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: jandem, Assigned: luke)
References
Details
Attachments
(1 file, 15 obsolete files)
26.31 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
As discussed in bug 594972, we need this for the Class.create pattern in v8-raytrace. This makes up for at least 100 ms on awfy, and speeding this up will also help many other benchmarks and websites.
Which aspect of |arguments| do you think is costing the most? Passing it to fun.apply, simply creating it, or accessing members of it?
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #1) > Which aspect of |arguments| do you think is costing the most? Passing it to > fun.apply, simply creating it, or accessing members of it? From the numbers in bug 594972 it looks like arguments[i] (ARGSUB) is pretty fast. "return arguments" is much slower than v8. Assigning it to a global variable or passing it to a function is also slow, so I assume creating the arguments object is slow. I can profile when I get home.
Comment 3•14 years ago
|
||
The prototypejs.org-style passing arguments only to fun.apply (Class initialize delegated application in v8 raytrace) is juicy, because we don't really need to create an arguments object. Never creating beats optimizing creation costs any day. /be
Reporter | ||
Comment 4•14 years ago
|
||
This patch adds a fast path for f.apply(x, arguments). It makes v8-raytrace more than twice as fast (415 ms -> 204 ms for me) One remaining TODO is adding support for call IC's; this patch generates just uncached calls, and some parts are hackish. I'd appreciate some feedback on how to fix this properly. Btw, I won't have much time next week to hack on this; feel free to steal it.
Comment 5•14 years ago
|
||
Great work! Cc'ing potential stealers. /be
Reporter | ||
Comment 6•14 years ago
|
||
Added some helper functions to make things easier to read. Also fixes the case where f is not a function in f.apply.
Attachment #475891 -
Attachment is obsolete: true
Attachment #475891 -
Flags: feedback?(dvander)
Reporter | ||
Comment 7•14 years ago
|
||
There's one jstest failure. It can be reduced to this: ----- function f() { return print.apply(null, arguments); } res = f(1, 2, 3); print(res); ----- This prints: ---- 1 2 3 function apply() { [native code] } ---- This happens for all builtins. For non-native functions everything works fine. What do native functions do differently?
Reporter | ||
Comment 8•14 years ago
|
||
Further cleanup. The bug I mentioned is still there.
Attachment #476206 -
Attachment is obsolete: true
Comment 9•14 years ago
|
||
Natives get the callee pushed on the interpreter stack slot where the return value should go on return. So a classic native function implementation mistake is to forget to JS_SET_RVAL before returning and the called native returns itself. Something in JM needs to track the apply'ed r.v. as copied there? /be
Comment 10•14 years ago
|
||
(In reply to comment #9) > Natives get the callee pushed on the interpreter stack slot where the return > value should go on return. So a classic native function implementation mistake > is to forget to JS_SET_RVAL before returning and the called native returns > itself. Something in JM needs to track the apply'ed r.v. as copied there? > > /be There are (or were, a few months ago) a fair number of these bugs still around. Developing the type inference was great at finding them, because it checks the actual returned value of all calls against the type specification. The inference patch should fix the ones I found.
Reporter | ||
Comment 11•14 years ago
|
||
@brendan: thanks, that made me realize what went wrong. This patch fixes the test and passes trace-tests and jstests. Now we need to fix the remaining hacks. It may be easier to steal now that it passes all tests..
Attachment #476233 -
Attachment is obsolete: true
Reporter | ||
Comment 12•14 years ago
|
||
I remember bhackett talking on IRC about many calls to Function.prototype.call in v8-deltablue. I think this could be fixed easily with this patch. For f.call(x, y) the stack looks like this:call, f, x, y. testNativeFunction can check for builtin Function.prototype.call. If it's not, just pass argc to emitUncachedCallVarArgs. If it *is* builtin Function.prototype.call, increment the argc register, so that we call f with x as this-value. Maybe we don't even need this as the number of arguments is not really variable. Could be worth a try though.
Comment 13•14 years ago
|
||
The Function.call thing on IRC turned out to be mostly incompetence in my measuring abilities. deltablue calls Function.call about 125k times, which works out to a loss of ~20ms (not nothing, but not huge) vs JSC. There are about 2 million calls to Invoke though (which I was confusing with Function.call), which look to be coming from non-Function.call slow paths and which are probably costing significantly (haven't looked further).
Reporter | ||
Comment 14•14 years ago
|
||
Cleanup and a small improvement. Now we use a single lea instruction to calculate the dynamic stack pointer instead of a shift and two adds. With this patch the micro-benchmark in bug 594972 becomes almost 3x faster. v8-raytrace still shows a 2x speedup (~200 ms on v8). Anybody willing to take over and drive this in?
Attachment #476403 -
Attachment is obsolete: true
Comment 15•14 years ago
|
||
I can take a look tomorrow. What's left -- I see various "XXX" marks in the patch, is there anything other than those?
Comment 16•14 years ago
|
||
I like the idea of function specific optimisations. I think it might be worthwile in real world applications to fastpath Array.prototype.slice.call(arguments, 1) as this is most common approach from my point of view to convert arguments into an array.
Comment 17•14 years ago
|
||
(In reply to comment #15) > I can take a look tomorrow. What's left -- I see various "XXX" marks in the > patch, is there anything other than those? I just looked at this patch for over an hour... I've barely looked at JM prior to now, and I barely understand what's going on. And for several of the "XXX" marks I can't even tell what the shortcoming is. Sorry, I'm not much use here.
Updated•14 years ago
|
blocking2.0: --- → ?
Reporter | ||
Comment 18•14 years ago
|
||
Sorry Nick, I will add some comments and attach a new version of the patch monday morning PDT.
Reporter | ||
Comment 19•14 years ago
|
||
(In reply to comment #18) > Sorry Nick, I will add some comments and attach a new version of the patch > monday morning PDT. While adding comments I realized that I can rewrite some stuff to make it a bit cleaner, I will try this approach tomorrow. Please don't review the current patch, it will change quite a bit.
Reporter | ||
Comment 20•14 years ago
|
||
Something like this. I added some comments and changed a few things. @dvander: I know there are more important bugs right now. Just requesting feedback so you won't forget this bug.
Attachment #476515 -
Attachment is obsolete: true
Attachment #477252 -
Flags: feedback?(dvander)
Comment on attachment 477252 [details] [diff] [review] WIP v6 >+mjit::Compiler::emitUncachedCallVarArgs(Assembler& masm, RegisterID varArgcReg, int nonVarArgc) >+{ This is a great start. We could cache the call with more work, but this will already be much faster. >+void >+mjit::Compiler::testNativeFunction(FrameEntry *fe, void *funcPtr, JumpList& notFunction) Comments on this at the end. >+ /* Check if receiver is a function. >+ * XXX: is testObject redundant after callprop? */ >+ FrameEntry *receiver = frame.peek(-2); >+ RegisterID dataReg = frame.copyDataIntoReg(receiver); >+ Jump notObject = frame.testObject(Assembler::NotEqual, receiver); >+ Jump notFunction = masm.testFunction(Assembler::NotEqual, dataReg); CALLPROP doesn't guarantee a function or even an object, though it probably won't get very far without one. So your checks here are correct. Maybe we should have a follow-up bug on eliminating them in the fast-path, but it probably wouldn't buy a lot for the added complexity. >+ >+ /* Check for builtin apply. */ >+ JumpList notApply; >+ testNativeFunction(frame.peek(-3), JS_DATA_TO_FUNC_PTR(void *, js_fun_apply), notApply); If we're compile-and-go (which content scripts are), there's an alternative to all this testing: 1. At compile time, lookup "apply" on our global object's Function.prototype. 2. If "apply" is an function-object whose JSFunction is js_fun_apply, bake in this check: cmp receiverType, OBJECT_TAG cmp receiverData, <funobj ptr> Another option is to layout JSFunction such that |native| overlays |script|. You wouldn't have to check flags and you're guaranteed the pointer is enough. You'll still end up with more checks, but but either way for scripts that will be shared across multiple globals, we'll have to do something a little slower.
Attachment #477252 -
Flags: feedback?(dvander) → feedback+
Updated•14 years ago
|
Assignee | ||
Comment 22•14 years ago
|
||
Comment on attachment 477252 [details] [diff] [review] WIP v6 >+int JS_FASTCALL >+stubs::PushArguments(VMFrame &f) ... >+ Value lenval; >+ if (!js_GetArgsProperty(cx, fp, ATOM_TO_JSID(rt->atomState.lengthAtom), &lenval)) >+ THROWV(0); >+ >+ if (lenval.isInt32()) { >+ argc = jsuint(lenval.toInt32()); >+ } else { >+ JS_STATIC_ASSERT(sizeof(jsuint) == sizeof(uint32_t)); >+ if (!ValueToECMAUint32(cx, lenval, (uint32_t *)&argc)) >+ THROWV(0); >+ } Since this logic is duplicated from js_fun_apply, perhaps it could be factored out into a function called by both. >+ /* XXX: return -1 to fall back to slow path? */ >+ if (!stack.ensureSpace(cx, f.regs.sp, argc)) >+ THROWV(0); ensureSpace will report an error if it returns false, so you are correct in propagating the error. >+ for(uint i=0; i<argc; i++) { >+ /*XXX: can this happen? >+ if (argsobj) { >+ if (argsobj->getArgsElement(arg).isMagic(JS_ARGS_HOLE)) >+ return argsobj->getProperty(cx, id, vp); >+ }*/ >+ f.regs.sp[i] = fp->canonicalActualArg(i); >+ } It is faster to call fp->forEachCanonicalActualArg(op) than calling fp->canonicalActualArg(i) for each arg. Also, I think the XXX condition can happen, a la: function f() { delete arguments[1]; g.apply(null, arguments); } In the case of a hole, though, you will want to do what js_fun_apply does, which is to the corresponding stack arg to undefined. Addressing both points (and hoisting CopyTo and CopyNonHoleArgs into a shared header, say, jsinterpinlines.h), you can simply write: if (argsobj) fp->forEachCanonicalActualArg(CopyNonHoleArgs(argsobj, regs.sp)); else fp->forEachCanonicalActualArg(CopyTo(regs.sp));
Reporter | ||
Comment 23•14 years ago
|
||
Thanks for the feedback guys. The Compartmental-GC patch was a 100 ms. win on v8-raytrace. I suppose that a big part of this is speeding up allocating/freeing the arguments object. This reduces the expected perf win from this patch - 200 ms to 150 ms, iirc.
Reporter | ||
Comment 24•14 years ago
|
||
This addresses David's comments and adds the compile-and-go optimization. I hope I got it right... I will address Luke's comments later this week. Factoring out arguments/apply logic sounds good though.
Attachment #477252 -
Attachment is obsolete: true
Reporter | ||
Comment 25•14 years ago
|
||
Forgot StubCalls.cpp + check for js_fun_apply at compile time.
Attachment #479138 -
Attachment is obsolete: true
Reporter | ||
Comment 26•14 years ago
|
||
Update to new call machinery. If I run V8 in the browser, v8-raytrace goes from 1185 to 2234 points (average of 5 runs). Total score from 2274 to 2518 (= 244 points). I still need to address Luke's comments.
Attachment #479148 -
Attachment is obsolete: true
Delicious! fun_apply bake-in looks good. Some drive by comments: dataReg comparison should be branchPtr, against the ImmPtr(funobj). I don't think it's safe to ignore the top 17 bits on x64. js_GetClassPrototype and GetProperty are fallible, so failure there has to propagate Compile_Error all the way up (yeah, gross). This is a pre-existing bug, but the compiler should pass scopeChain to js_GetClassPrototype instead of NULL. This is because a debug-mode recompilation can give the compiler a different frame than cx->fp(), and NULL grabs the scope off the current frame.
Reporter | ||
Comment 28•14 years ago
|
||
Thanks for the feedback. Unfortunately, I'm AFK the coming days, I will try to finish this patch monday/tuesday. (as always, feel free to steal if you want it in sooner)
Reporter | ||
Comment 29•14 years ago
|
||
This fixes all comments so far. I removed an if in js_fun_apply (because ValueToECMAUint32 special-cases it too) to make it more consistent with stubs::PushArguments. @dvander, is there a better way to test the value? I added an #ifdef for now, this works on 32 and 64 bit.
Attachment #480387 -
Attachment is obsolete: true
Attachment #480987 -
Flags: feedback?(lw)
Attachment #480987 -
Flags: feedback?(dvander)
Reporter | ||
Comment 30•14 years ago
|
||
There is a problem with this patch if you override arguments.length or replace the arguments object (arguments = ...). Should the fast path support this? It's possible, but I'm afraid this will add/duplicate quite some code. Otherwise, we can bake in a check before calling the stub. Or return -1 from stubs::PushArguments and if that happens jump to the slow case. Any suggestions?
Comment 31•14 years ago
|
||
(In reply to comment #30) > There is a problem with this patch if you override arguments.length or replace > the arguments object (arguments = ...). Should the fast path support this? I would say it's not important to make that case fast, unless there is a current important benchmark that does that. But I don't think they do. I think web frameworks do things like that sometimes, but hopefully not in a hot loop.
Reporter | ||
Comment 32•14 years ago
|
||
Rebease + fix overridden arguments and arguments.length. Let me know if there is a better solution, but this seems to work well.
Attachment #480987 -
Attachment is obsolete: true
Attachment #481047 -
Flags: feedback?(lw)
Attachment #481047 -
Flags: feedback?(dvander)
Attachment #480987 -
Flags: feedback?(lw)
Attachment #480987 -
Flags: feedback?(dvander)
Comment 33•14 years ago
|
||
Pretty sure dromaeo includes prototype stuff in hot loops that might do evil things with arguments....
Assignee | ||
Comment 34•14 years ago
|
||
Comment on attachment 481047 [details] [diff] [review] WIP v10 Thanks for making the changes. >--- a/js/src/jscntxt.h Tue Oct 05 10:49:27 2010 -0700 >+++ b/js/src/jscntxt.h Tue Oct 05 23:25:04 2010 +0200 >+ public: > inline bool ensureSpace(JSContext *maybecx, Value *from, ptrdiff_t nvals) const; >- >+ private: public: inline bool ensureSpace(... private: >+ CopyNonHoleArgs(JSObject *aobj, js::Value *dst) : aobj(aobj), dst(dst) {} >+ JSObject *aobj; >+ js::Value *dst; >+ void operator()(uintN argi, js::Value *src) { >+ if (aobj->getArgsElement(argi).isMagic(JS_ARGS_HOLE)) { >+ dst->setUndefined(); >+ } >+ else >+ *dst = *src; >+ ++dst; Could you remove the curlies around dst->setUndefined. Also could CopyTo and CopyNonHoleArgs go in jsinterpinlines.h (perhaps right below forEachCanonicalActualArg) inside namespace js? >+int JS_FASTCALL >+stubs::PushArguments(VMFrame &f) >+{ [snip] >+ f.regs.sp += argc; >+ return argc; >+} Normally, jit code knows the static offset of sp from fp, so I wouldn't think bumping f.regs.sp was necessary. Is there a special dependency on f.regs.sp here?
Reporter | ||
Updated•14 years ago
|
Summary: JM: make JSOP_ARGUMENTS fast → JM: make f.apply(x, arguments) fast
Comment 36•14 years ago
|
||
I just want to add some numbers that show how important this bug is. On this line of code: new Flag.RayTracer.Vector(...) we are slower than V8 by 0.43ms/(1000 calls) (all numbers are on a Mac Pro). I confirmed that the property accesses (Flag.RayTracer and Flag.RayTracer.Vector) are not the issue. We are a bit slower on new Object, by 0.043ms/(1000 calls), so that's 10% of the loss. But most of it is probably the apply. The benchmark makes 58268 calls to Class.Create() (specifically, a closure returned by that call), so leaving out the |new Object| slowdown, we are losing about 23 ms here. This is on a benchmark that does 1 cycle in 15 ms in V8 and 48 ms in SpiderMonkey. So this is 2/3 of the time gap.
Comment 37•14 years ago
|
||
I made a mistake in counting calls to Class.create() in comment 36. There are actually 69372 calls, so the speedup from this patch is expected to be 27 ms. And we are losing 3 ms on the basic |new Object| functionality. So there are only 3 ms of other losses to be found.
Reporter | ||
Comment 38•14 years ago
|
||
Rebased, addressed luke's comments. @luke: I removed the "f.regs.sp += argc" line (it has no effect but I added it to match the other stubs) Can you help me get this in?
Attachment #481047 -
Attachment is obsolete: true
Attachment #482611 -
Flags: review?(lw)
Attachment #481047 -
Flags: feedback?(lw)
Attachment #481047 -
Flags: feedback?(dvander)
Reporter | ||
Comment 39•14 years ago
|
||
Oops, this is the correct version.
Attachment #482611 -
Attachment is obsolete: true
Attachment #482623 -
Flags: review?(lw)
Attachment #482611 -
Flags: review?(lw)
Assignee | ||
Comment 40•14 years ago
|
||
@jandem: Sorry for the silence. I was considering how this patch would best interact with other Function.call/apply optimizations (bug 602129 and bug 602262) and the possibility of reusing the callIC instead of using uncached calls for all these. I'll play around with your path and see if the idea works. Whether it does or doesn't, I am happy to help get this in.
Assignee | ||
Comment 41•14 years ago
|
||
Update: the call IC seems to be working out. The work in bug 602129 will be a pre-requisite to this patch.
Assignee | ||
Comment 42•14 years ago
|
||
Patch is lightly tested and applies on top of bug 602129. Measuring just this patch (-m in the ss harness): raytrace: 1.64x as fast 233.5ms +/- 0.5% 142.2ms +/- 0.9% Combined (i.e., compared to trunk): raytrace: 2.12x as fast 299.5ms +/- 0.7% 141.0ms +/- 0.3%
Assignee: jandemooij → lw
Assignee | ||
Comment 43•14 years ago
|
||
Attachment #482623 -
Attachment is obsolete: true
Attachment #485436 -
Attachment is obsolete: true
Attachment #482623 -
Flags: review?(lw)
Assignee | ||
Comment 44•14 years ago
|
||
Green on try.
Attachment #485464 -
Attachment is obsolete: true
Attachment #486281 -
Flags: review?(dvander)
Assignee | ||
Updated•14 years ago
|
Attachment #486281 -
Attachment is patch: true
Attachment #486281 -
Attachment mime type: application/octet-stream → text/plain
Reporter | ||
Comment 45•14 years ago
|
||
Wow, this is way better than my hacks ;) I guess the call ICs did help quite a bit? 2.12x is more than I got IIRC.
Comment on attachment 486281 [details] [diff] [review] patch >+ JS_ASSERT_IF(applyTricks == NoApplyTricks, frame.stackDepth() == opinfo->stackDepth); Instead of this, >+ applyTricks = LazyArgsObj; Move the other frame.pushSynced() outside of the else branch, and pop the placeholder when you don't want it anymore. That way the compiler state is more coherent outside of the apply trick. > void >-mjit::Compiler::checkCallSpeculation(uint32 argc, >+mjit::Compiler::checkCallSpeculation(uint32 callImmArgc, uint32 origArgc, Nit: It wasn't obvious at first what "origArgc" was. It seems like the argc in the bytecode should be "origArgc" and the speculated argc should be something else, like "specialArgc". > #ifdef JS_MONOIC > if (origCallee->isConstant() || origCallee->isNotType(JSVAL_TYPE_OBJECT) || debugMode) { > #endif >- emitUncachedCall(argc, callingNew); >+ JS_ASSERT(applyTricks == NoApplyTricks); Okay, so this shouldn't fire because something like 3.apply(5, arguments) will deoptimize, even for strings where JM still assumes primitives need to be wrapped? >+ JS_ASSERT(GET_ARGC(f.regs.pc) == 2); >+ >+ /* >+ * The lazyArgsObj flag indicates an optimized call |f.apply(x, arguments)| >+ * where the args obj has not been created (so regs.sp[-1] is uninitialized). >+ */ Explain the state of f.regs.sp before and after, here, for each case. >+ /* Can't optimize; just create an args obj and take the normal path. */ >+ f.regs.sp++; Add a comment here about why f.regs.sp needs to be bumped. In stubs we tend to bump sp where &sp[-X] would suffice, so it's best to give an explicit reason in cases like this.
Attachment #486281 -
Flags: review?(dvander) → review+
Assignee | ||
Comment 47•14 years ago
|
||
> Move the other frame.pushSynced() outside of the else branch, and pop the > placeholder when you don't want it anymore. That way the compiler state is more > coherent outside of the apply trick. Awesome observation > Okay, so this shouldn't fire because something like 3.apply(5, arguments) will > deoptimize, even for strings where JM still assumes primitives need to be > wrapped? My reasoning was: JSOP_FUNAPPLY is only emitted if you have x.apply, which means JSOP_CALLPROP, and JSOP_CALLPROP seemed to push non-constant callee/this. But, it seems like this could change in the future, so I'll handle the case, good catch!
Assignee | ||
Comment 48•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/92af3359a18f
Reporter | ||
Comment 49•14 years ago
|
||
Landed last month on mozilla-central: http://hg.mozilla.org/mozilla-central/rev/92af3359a18f
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•